Search CORE

37 research outputs found

Challenge Results Are Not Reproducible

Author: Grab Georg
Maier-Hein Lena
Reinke Annika
Publication venue
Publication date: 14/07/2023
Field of study

While clinical trials are the state-of-the-art methods to assess the effect of new medication in a comparative manner, benchmarking in the field of medical image analysis is performed by so-called challenges. Recently, comprehensive analysis of multiple biomedical image analysis challenges revealed large discrepancies between the impact of challenges and quality control of the design and reporting standard. This work aims to follow up on these results and attempts to address the specific question of the reproducibility of the participants methods. In an effort to determine whether alternative interpretations of the method description may change the challenge ranking, we reproduced the algorithms submitted to the 2019 Robust Medical Image Segmentation Challenge (ROBUST-MIS). The leaderboard differed substantially between the original challenge and reimplementation, indicating that challenge rankings may not be sufficiently reproducible.Comment: Accepted at BVM 202

arXiv.org e-Print Archive

Deployment of Image Analysis Algorithms under Prevalence Shifts

Author: Christodoulou Evangelia
Ferrer Luciana
Godau Patrick
Jäger Paul
Kalinowski Piotr
Maier-Hein Lena
Reinke Annika
Tizabi Minu
Publication venue
Publication date: 24/07/2023
Field of study

Domain gaps are among the most relevant roadblocks in the clinical translation of machine learning (ML)-based solutions for medical image analysis. While current research focuses on new training paradigms and network architectures, little attention is given to the specific effect of prevalence shifts on an algorithm deployed in practice. Such discrepancies between class frequencies in the data used for a method's development/validation and that in its deployment environment(s) are of great importance, for example in the context of artificial intelligence (AI) democratization, as disease prevalences may vary widely across time and location. Our contribution is twofold. First, we empirically demonstrate the potentially severe consequences of missing prevalence handling by analyzing (i) the extent of miscalibration, (ii) the deviation of the decision threshold from the optimum, and (iii) the ability of validation metrics to reflect neural network performance on the deployment population as a function of the discrepancy between development and deployment prevalence. Second, we propose a workflow for prevalence-aware image classification that uses estimated deployment prevalences to adjust a trained classifier to a new environment, without requiring additional annotated deployment data. Comprehensive experiments based on a diverse set of 30 medical classification tasks showcase the benefit of the proposed workflow in generating better classifier decisions and more reliable performance estimates compared to current practice

arXiv.org e-Print Archive

Ten years of image analysis and machine learning competitions in dementia

Author: Alexander Daniel C.
Bron Esther E.
Klein Stefan
Maier-Hein Lena
Oxtoby Neil P.
Papma Janne M.
Reinke Annika
Publication venue: 'Elsevier BV'
Publication date: 18/02/2022
Field of study

Machine learning methods exploiting multi-parametric biomarkers, especially based on neuroimaging, have huge potential to improve early diagnosis of dementia and to predict which individuals are at-risk of developing dementia. To benchmark algorithms in the field of machine learning and neuroimaging in dementia and assess their potential for use in clinical practice and clinical trials, seven grand challenges have been organized in the last decade. The seven grand challenges addressed questions related to screening, clinical status estimation, prediction and monitoring in (pre-clinical) dementia. There was little overlap in clinical questions, tasks and performance metrics. Whereas this aids providing insight on a broad range of questions, it also limits the validation of results across challenges. The validation process itself was mostly comparable between challenges, using similar methods for ensuring objective comparison, uncertainty estimation and statistical testing. In general, winning algorithms performed rigorous data preprocessing and combined a wide range of input features. Despite high state-of-the-art performances, most of the methods evaluated by the challenges are not clinically used. To increase impact, future challenges could pay more attention to statistical analysis of which factors relate to higher performance, to clinical questions beyond Alzheimer's disease, and to using testing data beyond the Alzheimer's Disease Neuroimaging Initiative. Grand challenges would be an ideal venue for assessing the generalizability of algorithm performance to unseen data of other cohorts. Key for increasing impact in this way are larger testing data sizes, which could be reached by sharing algorithms rather than data to exploit data that cannot be shared.Comment: 12 pages, 4 table

arXiv.org e-Print Archive

EUR Research Repository

UCL Discovery

BIAS: Transparent reporting of biomedical image analysis challenges

Author: Arbel Tal
Eisenmann Matthias
Hanbury Allan
Jannin Pierre
Kopp-Schneider Annette
Kozubek Michal
Landman Bennett A.
Maier-Hein Lena
Martel Anne L.
Müller Henning
Onogur Sinan
Reinke Annika
Saez-Rodriguez Julio
van Ginneken Bram
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

The number of biomedical image analysis challenges organized per year is steadily increasing. These international competitions have the purpose of benchmarking algorithms on common data sets, typically to identify the best method for a given problem. Recent research, however, revealed that common practice related to challenge reporting does not allow for adequate interpretation and reproducibility of results. To address the discrepancy between the impact of challenges and the quality (control), the Biomedical Image Analysis ChallengeS (BIAS) initiative developed a set of recommendations for the reporting of challenges. The BIAS statement aims to improve the transparency of the reporting of a biomedical image analysis challenge regardless of field of application, image modality or task category assessed. This article describes how the BIAS statement was developed and presents a checklist which authors of biomedical image analysis challenges are encouraged to include in their submission when giving a paper on a challenge into review. The purpose of the checklist is to standardize and facilitate the review process and raise interpretability and reproducibility of challenge results by making relevant information explicit

arXiv.org e-Print Archive

Hes-so: ArODES Open Archive (University of Applied Sciences and Arts Western Switzerland / Haute école spécialisée de Suisse occidentale / FH Westschweiz)

HAL-Inserm

Univerzitní repozitář Masarykovy univerzity

Hal-Diderot

HAL-Rennes 1

The Medical Segmentation Decathlon

Author: Antonelli Michela
Arbeláez Pablo
Bae Byeonguk
Bakas Spyridon
Bilello Michel
Bilic Patrick
Christ Patrick F
Do Richard K G
et al
Farahani Keyvan
Gollub Marc J
Heckers Stephan H
Huisman Henkjan
Jarnagin William R
Kopp-Schneider Annette
Landman Bennett A
Litjens Geert
McHugo Maureen K
Meakin James A
Menze Bjoern
Napel Sandy
Ourselin Sebastien
Pernicka Jennifer S Golia
Reinke Annika
Rhode Kawal
Ronneberger Olaf
Summers Ronald M
Tobon-Gomez Catalina
van Ginneken Bram
Vorontsov Eugene
Wiesenfarth Manuel
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 15/07/2022
Field of study

International challenges have become the de facto standard for comparative assessment of image analysis algorithms. Although segmentation is the most widely investigated medical image processing task, the various challenges have been organized to focus only on specific clinical tasks. We organized the Medical Segmentation Decathlon (MSD)—a biomedical image analysis challenge, in which algorithms compete in a multitude of both tasks and modalities to investigate the hypothesis that a method capable of performing well on multiple tasks will generalize well to a previously unseen task and potentially outperform a custom-designed solution. MSD results confirmed this hypothesis, moreover, MSD winner continued generalizing well to a wide range of other clinical problems for the next two years. Three main conclusions can be drawn from this study: (1) state-of-the-art image segmentation algorithms generalize well when retrained on unseen tasks; (2) consistent algorithmic performance across multiple tasks is a strong surrogate of algorithmic generalizability; (3) the training of accurate AI segmentation models is now commoditized to scientists that are not versed in AI model training

ZORA

Comparative validation of machine learning algorithms for surgical workflow and skill analysis with the HeiChole benchmark

Purpose: Surgical workflow and skill analysis are key technologies for the next generation of cognitive surgical assistance systems. These systems could increase the safety of the operation through context-sensitive warnings and semi-autonomous robotic assistance or improve training of surgeons via data-driven feedback. In surgical workflow analysis up to 91% average precision has been reported for phase recognition on an open data single-center video dataset. In this work we investigated the generalizability of phase recognition algorithms in a multicenter setting including more difficult recognition tasks such as surgical action and surgical skill. Methods: To achieve this goal, a dataset with 33 laparoscopic cholecystectomy videos from three surgical centers with a total operation time of 22 h was created. Labels included framewise annotation of seven surgical phases with 250 phase transitions, 5514 occurences of four surgical actions, 6980 occurences of 21 surgical instruments from seven instrument categories and 495 skill classifications in five skill dimensions. The dataset was used in the 2019 international Endoscopic Vision challenge, sub-challenge for surgical workflow and skill analysis. Here, 12 research teams trained and submitted their machine learning algorithms for recognition of phase, action, instrument and/or skill assessment. Results: F1-scores were achieved for phase recognition between 23.9% and 67.7% (n = 9 teams), for instrument presence detection between 38.5% and 63.8% (n = 8 teams), but for action recognition only between 21.8% and 23.3% (n = 5 teams). The average absolute error for skill assessment was 0.78 (n = 1 team). Conclusion: Surgical workflow and skill analysis are promising technologies to support the surgical team, but there is still room for improvement, as shown by our comparison of machine learning algorithms. This novel HeiChole benchmark can be used for comparable evaluation and validation of future work. In future studies, it is of utmost importance to create more open, high-quality datasets in order to allow the development of artificial intelligence and cognitive robotics in surgery

KITopen